Index Site: http://www.fh-furtwangen.de/~webers/index.htm
The CW - Algorithm
by Thomas Carl Cierzynski and Stefan von Weber
Many investigations have been carried out to find stable algorithms for the use in statistical applications (Bai 1988, Cook/ Weisberg 1994, Läuter 1994, Wegscheider 1984). The CW-algorithm (Cierzynski/ v.Weber 1990) was originally developed for the use in regression analysis. There it reduces about 10% of the mean quadratic error of prediction. The use of the CW-algorithm is practicable if data are prone to relatively great errors, if more variables exist than individuals and the variables are highly correlated (Cierzynski/v.Weber 1992). On the other hand, for theoretical reasons, the use of the CW-algorithm in the discriminant analysis e.g. cannot deliver results as good as those for the regression analysis. The reason is the lack of the dominating effect of CWA, the inherent variable selection. However the positive effect, that similar variables are averaged instead of subtracted, results in a more stable solution in this case also.
Let be XX the covariance matrix with p rows and p columns and let be SS an unit matrix of the same dimension. The CW-algorithm given as procedure in the Turbo Pascal language delivers the pseudo inverse matrix XX-1 on the place SS.
PROCEDURE CWA_Inversion
(p:integer;VAR xx,ss:rDDmat);
VAR rel,s,c1,uu,v,umax,vmax,du:real;
maxstep,imstep,i,j,k,kp,ka,ip,jp:integer;
b,wg,wh:rDvek;
begin
rel:=1.01; { To scatter similar variables }
s:=0.1; { Step-width }
maxstep:=round(p/s); {Number of steps }
{ Take row kk of matrix ss }
for kk:=1 to p do
begin
{ Initial values for b,wg, wh }
for i:=1 to p do
begin
b[i]:=0; { New row of matrix ss }
wg[i]:=-1e25; { auxiliary vector }
wh[i]:=ss[kk,i]; {"covariance vector" }
end;
k:=0;
for imstep:=1 to maxstep do
begin
{Seek row with maximal gain of variance}
ka:=k;
vmax:=-1e25;
for jp:=1 to p do
begin ip:=ka+jp; if ip>p then ip:=ip-p;
if wg[ip]<>0 then
begin uu:=wh[ip]/xx[ip,ip];
v:=wh[ip]*uu;
if (v>vmax) then
begin k:=ip;
vmax:=v*rel;
umax:=uu;
end;
end;
end;
{"linear regression coefficient" }
uu:=umax;
if b[k]=0 then wg[k]:=abs(uu)*s;
du:=abs(uu);
if du>wg[k] then du:=wg[k];
if uu<0 then du:=-du;
{ update covariance vector wh }
for kp:=1 to p do
if wg[kp]<>0 then
wh[kp]:=wh[kp]-du*xx[kp,k];
{ update row of ss *)}
b[k]:=b[k]+du;
end; { end of imstep-loop }
(* Store b in matrix ss *)
for i:=1 to p do ss[kk,i]:=b[i];
end;
end; { End of procedure CWA-Inversion }
The three most important ideas of the CW-algorithm are:
i) In each step only that coefficient is changed which delivers maximal gain of variance.
ii) The damping factor s<<1, e.g. s=0.1, enables the averaging of similar variables.
iii) By limitation of the number of steps the Gauss-Jordan solution is prevented.
References
Bai, Zhao Jun
(1988): Numerical Treatment of restricted Gauss-Markov model, Comm.Statist.Simulation Comput. 17 (1988) no.2, 569-579Cierzynski,T.C, v.Weber,S. (1990): Simulation Experiments with a new stable Regression Algorithm. J. Syst. Analysis, Modelling and Simulation, Vol 7 (2)
Cierzynski,T.C, v.Weber,S. (1992): Simulation Experiments with CWA, a New stable Regression Algorithm and Comparision with two other stable Regression Algorithms, Computational Statistics and Data Analysis(13) 4
Cook,R.D., Weisberg, S. (1994): An Introduction to Regression Graphics, John Wiley & Sons, New York
Fahrmeir,L.,Hamerle,A. (1984): Multi-variate statistische Verfahren, Walter de Gruyter, Berlin
Läuter,J. (1994): Stabile multivariate Ver-
fahren, Akademieverlag
Wegscheider, K. (1986): Invarianz-eigenschaften von Diskriminanzanalyse-verfahren ,. Studien zur Angewandten Wirtschaftsforschung und Statistik, 19. Vandenhoeck & Ruprecht, Göttingen
Simulation Experiments with
CWA,
a New stable Regression Algorithm,
and
Comparisons with Two other Stable Regression Algorithms
by
Thomas C. Cierzynski and Stefan v. Weber
Karl-Weierstrass-Institute of Mathematics Berlin
and Fachhochschule Furtwangen
1. Introduction
The common model of linear regression with the dependent variable Y and p independent variables X1,...,Xp can be described by
yi = b0 + b1 xi1 + ... + bp xip + ei (1)
where (Y) is the vector of the observed values yi of the dependent variable, (X) is the data matrix of values xij of the independent variables (sample plan), b0 is the intercept term or regression constant, (B) is the vector (b1,b2,...,bp)' of the regression coefficients and (E) is the vector of the observation errors or residuals ei respectively (Schoenberg 1971, Weisberg 1986).
The two most important aims of linear regression analysis are:
i) Forecasting or prediction of new cases, i.e. estimation of the regression coefficients in such a way that
for a given new point (xm1,...,xmp)' the estimated ym has a minimal error with respect to the
observed ym.
ii) Identification of a model, i.e. the selection of suitable X-variables.
The multicollinearity of X-variables is a special problem in regression analysis. On the one side the variance of the estimated regression coefficients increases with growing multicollinearity and becomes infinite in the case of linear dependent X-variables (Jahn 1986, Weisberg 1986). On the other side the MSE (Mean Square Error) of prediction is not influenced by multicollinearity if the suppositions for the linear regression model are fulfilled (Goldstein 1984, Weisberg 1986). In practice we often have to deal with incorrect models and other deviations from the assumptions, and so multicollinearity disturbs prediction of responses or identification of models.
The latter problem is commonly solved by selection of variables (stepwise regression) or by changing the estimated covariance matrix (Ridge regression, restrictions, Schoenberg 1971, Weisberg 1986). J. Läuter (1987 a,b) remarks that the selection of variables is annihilating an essential part of the present information. He proposes new algorithms which avoid strong selection of variables as one way to achieve more stability.
Stability cannot be reached in all cases, but to get better results one can try to use special properties which can often be found in real data populations. J. Läuter has noted an undisclosed reserve in the handling of highly correlated variables which can be interpreted as consequences of an underlying (unknown) factor structure (Läuter 1987c).
In this paper our aim is to stabilize the regression in the case of highly correlated variables and relatively small sample sizes with respect to high precision of prediction. For this we developed two new regression procedures and compared them with the classical stepwise procedure by simulation studies.
2. CWA regression
The common way to estimate linear regression coefficients is the stepwise selection of independent variables avoiding high multicollinearity in the chosen set. The stepwise algorithm is a derivative of the Gaussian elimination procedure for the solution of systems of linear equations (Weisberg 1986). For an independent variable Xj only two alternatives exist with respect to the resulting regression model - it was included or not.
The following algorithm is an alternative to this.
Let there be N vectors (yi,xi1,xi2,...,xip)' with i=1,...,N.
yi is the value of the independent variable and
xi1,...,xip are the values of the p independent variables (X1),...,(Xp).
y. denotes the arithmetic mean value of the yi,
x.j denotes the arithmetic mean value of the independent variable Xj with j=1,...,p,
(XY) denotes the vector of the sums of products
(xij-x.j) ( yi-y.) over i=1,...,N for j=1,...,p,
(XX) denotes the matrix of sums of squares and products (xij-x.j)(xik-x.k)
over i=1,...,N for j=1,...,p and k=1,...,p.
The solution of the system of linear equations
(2) (XX) (?) = (XY)
delivers the GAUSS-MARKOV solution of the regression problem, which is a solution without selection of variables. In our algorithm (CWA) we use an iterative procedure comparable to the GAUSS-SEIDEL algorithm. The three main ideas are (Cierzynski & v. Weber 1990):
i) In each step of iteration we only change that coefficient bk which optimizes the convergence in some
sense.
ii) Using a damping factor d<<1, e.g. d=0.1, we want toprevent the overswing of the
intermediate solutions.
iii) By a given upper limit m of iteration steps we prevent the inclusion of too much independent
variables into the final model.
Starting with an initial solution (F)=0 for the wanted estimation of the regression coefficients (B) the CWA-algorithm improves (F) step by step in such a way that
(3) (XX) (B-F) = (XY) - (XX)(F) = (H)
is converging to the zero vector and (F) to an estimation of (B). At the starting point of the iteration (H) is equal to (XY). In each step of the iteration the univariate regression coefficients of the residues
(4) Uj = Hj / (XX)jj for j=1,...,p
and the corresponding shares in the sum of squared residues
(5) Vj = Uj Hj for j=1,...,p
are computed. CWA seeks the solution in direction of the maximal Vj. Let q be the index of the Vj with the greatest absolute value. The q-th component of (F) is improved by
(6) F'q = Fq + d Uq.
Here the damping factor d is a positive number within the interval (0,1), e.g. 0.1. Then the new H'j are calculated by
(7) H'j = Hj - d Uq (XX)jq for j=1,...,p.
The value d=0.1 we used in most cases. But we propose some greater value (0.2 or 0.3) to save computer time if the number of X-variables exceeds 100.
We remark that we can also iterate a 'pseudoinverse' of the `correlation matrix` (Cxx) which is defined by its elements
(Cxx)jr = (XX)jr / (Wj Wr)
where the Wj are the square roots of the diagonal elements (XX)jj of (XX). That is, (Cxx)jr can be interpreted as the correlation coefficients between the (deterministic) Variables Xj and Xr. The 'pseudoinverse' will be computed simultaneously and analoguously to the estimation of the regression coefficients. Starting with
(8) (I) - (Cxx) (C) = (D).
with the initial setting (C)=0 we improve (C) in each step in the following way: Let q be the above defined index and r the row index with the greatest absolute value in the q-th column of (D). Then if holds that
(9) C'rq = Crq + d Drq / (Cxx)rr
since (Cxx)rr=1 this substituted in (8) results into (D') where
(10) D'jq = Djq - (Cxx)jr d Drq for j=1,...,p.
Simulations showed that (C) is converging to the inverse of (Cxx) if (Cxx) is of full rank. What kind of ´pseudoinverse´ (C) represents will be the subject of later investigations.
The approximation of (B) and the 'pseudoinverse' is stopped after m steps. We find m by setting
(11) m = ((1/d) ln(N)) / ln(g).
where g is a positive number. g is the minimum number of points needed to represent statistically one dimension of the p-dimensional space. Our test with several data sets and several values of g showed that g=3 is giving the best results.
The philosophy of condition (11) is that one should not use more than m*d 'effective variables' approximating a hyperplane in a sample of N points. For examples use only one 'effective variable' in the case of N=3 points or two 'effective variables' in the case of N=9 points. One may translate the heuristic notion 'effective variable' as an uncorrelated variable or factor. If the variables Xj are not correlated 1/d steps will involve one 'effective variable' in the model and m steps then m*d 'effective variables' (Cierzynski & v. Weber 1990).
3. The Resampled Average Modelling Method (RAMM)
The RAMM using K randomly selected samples generated by one of the common resampling methods (Lachenbruch/ Mickey 1965) and applying to them the common stepwise regression method you get K models Mk with k=1,...,K. The models are in general different in number, identity and weights of the independent variables. Now taking into account the additivity of the linear regression model and that each model Mk describes the original sample well it seems to be reasonable to estimate the regression coefficients as the arithmetic mean of the estimated regression coefficients over the models Mk. This is just what RAMM is doing.
This averaging procedure can in a formal way also be done with the correlation matrix, the inverse correlation matrix and the degrees of freedom.
After summation and averaging one may delete the rows and columns of missing variables in the matrices or not. One can calculate the further statistics of the regression analysis (intercept term, standard deviations of the regression coefficients, tests) in the common way. We emphasize once more the heuristic concept of the RAMM and that a theoretical foundation is outstanding. We refer here to the stepwise regression with a similar state of art.
Some properties of the common least squares estimator (LSE) are lost. The residuals in general will not have the expectation 0 and the inverse correlation matrix C as averaged sum of K different inverse matrices is in general not the inverse matrix of the averaged sum of the correlation matrices.
4. Data generation for the Monte-Carlo study
Monte-Carlo studies are legitimate tools to verify heuristic hypotheses. They use random samples of natural data sets or artifical data. In this study we used artifical data only.
Each of the generated data matrices consists of 41 columns and N rows. The first column was always reserved for the dependent variable Y. A value yi was calculated by
(12) yi = b1 Fi1 + b2 Fi2 + 0.25 N(0;1), i=1,...,N.
Here Fik are realizations of two factors F1 and F2 while b1 and b2 are constants differing from data matrix to data matrix. Additionally we use a third factor F3 to design a special factor structure of the independent variables Xj.
In order to generate a data matrix we first generated the b1 and b2 coefficients as well as the replications Fik by means of a generator of normally distributed random numbers, with mean zero and variance one.
Each of the 40 independent variables Xj is associated with one of the 3 factors Fk. This association is defined by the random vector K. One xij is calculated by
(13) xij = Aj Fi,Kj + (1 - Aj ) N(0;1), j=1,...,40
where
(14) Aj = 0.91 - 0.01 j.
This means the correlation between the independent variable Xj and the related factor FKj decreases with increasing j systematically. The dependent variable Y is correlated via the factors F1 and F2 more or less with about 2/3 of the X-variables. To reduce systematic errors a dummy sequence of N(0;1)-randomnumbers of length Sz was generated before starting the data generation. We coupled Sz and N in such a way that holds Sz=10 N + m with m=1,3,6 for the 3 repetitions of each generation run to a fixed N. For example Sz has the values 91, 93 and 96 for N=9.
The resampling method used was the following: we have randomly chosen 10% but at least one element of the sample as the working sample and the rest as the learning sample. This was done for the original sample 10 times independently.
As in the generation of the data matrices a similar dummy sequence of random numbers of constant length 17 was used. A more theoretical and general view of resampling methods in regression analysis can be found in in Bunke & Droge 1984.
5. Numerical results
Table 1 contains the compressed results of the simulation runs. The following notations are used:
C the CWA regression,
R the RAMM regression,
S the stepwise regression (standard method),
N the number of cases in the original sample,
V the estimation of the standard error of forecasting by
P t he estimation of the standard error of forecasting using an independent sample
of 1000 additionally generated cases,
U the estimation of the standard error of forecasting by resubstitution in the original
sample (mean residual).
C-V the combination of CWA regression and V-error estimation and so on.
Table 1: Results of the simulation experiments
CWA regression RAMM regression Stepw. regression
N C-V C-P C-U R-V R-P R-U S-V S-P S-U
----------------------------------------------------------------------------------------
5 0.629 0.568 0.156 0.743 0.555 0.460 1.220 0.664 0.148
6 2.362 0.982 0.384 2.316 1.462 1.227 3.331 0.890 0.143
40 0.332 0.304 0.244 0.375 0.327 0.224 0.382 0.371 0.238
45 0.341 0.311 0.257 0.300 0.328 0.249 0.327 0.334 0.255
50 0.335 0.315 0.251 0.362 0.313 0.251 0.366 0.313 0.331
60 0.320 0.327 0.242 0.301 0.323 0.214 0.349 0.339 0.216
70 0.265 0.281 0.219 0.266 0.286 0.218 0.276 0.290 0.224
80 0.288 0.310 0.251 0.287 0.310 0.227 0.290 0.333 0.227
90 0.298 0.292 0.261 0.308 0.315 0.265 0.322 0.322 0.268
100 0.286 0.307 0.260 0.282 0.302 0.258 0.289 0.316 0.264
S-V and S-P are the mean values of 6 original values (3 repetitions from the two comparing runs CWA with stepwise regression and RAMM with stepwise regression). Each other value is the arithmetic mean of the 3 repetitions. The second row of Table 1 (N=6) gives a typical example of a worse case. Here the selection of one important point led to extreme results.
Table 2: Compressed P, settled squares and percentages
in relation to the stepwise regression.
mean residuals settled squares perc.
N ! C-P R-P S-P ! C-P R-P S-P ! C% R%
--------------------------------------------------------------------------------------------
5- 10 ! 0.633 0.721 0.704 ! 0.338 0.457 0.433 ! 78 106
12- 20 ! 0.468 0.418 0.479 ! 0.157 0.112 0.167 ! 94 67
25- 50 ! 0.323 0.331 0.356 ! 0.042 0.047 0.064 ! 66 73
60-100 ! 0.303 0.307 0.320 ! 0.029 0.032 0.040 ! 73 80
--------------------------------------------------------------------------------------------
5-100 ! 0.436 0.452 0.471 ! 0.128 0.142 0.159 ! 81 89
The mean gain of reducible squares of the CWA regression is 119% and of the RAMM regression 111% in relation to the stepwise regression with 100%.
Since we can calculate P only in the case of artifical data in cases of natural data sets we are forced to decide by other and less objective error estimations, e.g. the V- or U-estimation.
6. Conclusions
The numerical results from chapter 5 allow us to draw the following conclusions for the case of factorial data structures similar to that used in our study:
i) The probability of getting a better prediction is higher using CWA regression or RAMM regression rather then stepwise regression
ii) The CWA regression reduces additionally 19% of the reducible error variance if we set the reducible error of the stepwise regression to 100%. The RAMM regression reduces additionally 11%.
iii) If one wants to know for a given data set whether the CWA regression or stepwise regression is better one should use the 'cross-validated' error estimation (V-estimation). But in the case of comparison between RAMM regression and stepwise regression one should use the error estimation by resubstitution (U-estimation).
References
Ahrens,H.;Läuter,J.(1981): Mehrdimensionale Varianzanalyse. Akademie-Verlag Berlin
Bunke,O.;Droge,B.(1984): Bootstrap and cross-validation estimates of the prediction error for linear regression models. Ann. Statist. 12, 1400-1424
Cierzynski,Th.;v.Weber,S.(1990): Simulation Experiments with a new stable Regression algorithm. J. Syst. Analysis, Modelling and Simulation, Vol 7 (2)
Cierzynski, Thomas C., Klaus Hermann and Stefan von Weber(1991): Simulation Experiments with three
stable Regressionalgorithms. Preprint Karl Weierstrass-Institute of Mathe-matics, Berlin
Goldstein,M.;Dillon,W.(1984): Multivariate analysis methods and applications. Wiley, New York,
Jahn,W.(1986): Modellwahl für die Regressionsanalyse mit stochastischen Regressoren und hohem Grad an Multikollinearität. Forschungsbericht der KMU Leipzig, Sektion Mathematik
Karian, Z. and Dudewicz, E. (1991): Modern statistical systems and GPSS simulation: the first course. Computer Sciences Press, Freeman and Company, New York
Lachenbruch, P.A. and M.R. Mickey (1965): Estimation of the multiple correlation coefficient and mean square prediction error. Paper presented at Biometric Society meetings, WNAR, Riverside, California
Läuter,J.(1987a): Stable Decisions in Discriminant, Regression and Factor Analysis. Proceedings of the International Workshop on Model-Oriented Data Analysis, Eisenach, GDR
Läuter,J.(1987b): The Problems of the Statistical Stability in the Discriminant Analysis. Seminar Data Analysis (1987) Varna, Bulgarien
Läuter,J.(1987c): Stabilisierung multivariater statistischer Verfahren. Symposium Charite "Data Analysis" der Charite, Berlin
Schoenberg,P.(1971): Methoden der Ökonometrie, Bd. II, Verlag Franz Vahlen, Mnchen
Weisberg,S.(1986): Applied linear regression, John Wiley & Sons, New York
Summary
In a Monte-Carlo study the authors compared CWA, RAMM and stepwise regression. CWA and RAMM work in cases with a high amount of multicollinearity in the X-variables with higher accuracy of prediction. To decide whether use CWA or stepwise regression the authors recommend a cross-validated error estimation. To decide whether use RAMM or stepwise regression one should use the error estimation by resubstitution.
Index Site: http://www.fh-furtwangen.de/~webers/index.htm